Goto

Collaborating Authors

 Alto Paraná


MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Valentini, Francisco, Cotik, Viviana, Furman, Damián, Bercovich, Ivan, Altszyler, Edgar, Pérez, Juan Manuel

arXiv.org Artificial Intelligence

Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google's autocomplete API and relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.


Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

Josifoski, Martin, Sakota, Marija, Peyrard, Maxime, West, Robert

arXiv.org Artificial Intelligence

Large language models (LLMs) have great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. Leveraging this asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed SynthIE, that outperform the prior state of the art (with equal model size) by a substantial margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.


Optimal Crops Selection using Multiobjective Evolutionary Algorithms

Brunelli, Ricardo (National University of Asuncion) | Lücken, Christian von (National University of Asuncion)

AI Magazine

Farm managers have to deal with many conflicting objectives when planning which crop to cultivate. Soil characteristics are extremely important when determining yield potential. Fertilization and liming are commonly used to adapt soils to the nutritional requirements of the crops to be cultivated. Planting the crop that will best fit the soil characteristics is an interesting alternative to minimize the need for soil treatment, reducing costs and potential environmental damages. In addition, farmers usually look for investments that offer the greatest potential earnings with the least possible risks. According to the objectives to be considered the crop selection problem may be difficult to solve using traditional tools. Therefore, this work proposes an approach based on Multiobjective Evolutionary Algorithms to help in the selection of an appropriate cultivation plan considering five crop alternatives and five objectives simultaneously.